In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.
This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle.
This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.
Our analysis is driven by three core objectives:
Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.
By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.
The dataset contains the following attributes:
| Attribute Name | Description | |
|---|---|---|
| 1 | name |
The make and model of the vehicle (e.g., Hyundai i10, Honda City). |
| 2 | year |
The year the vehicle was manufactured. |
| 3 | selling_price |
The selling price of the vehicle in Indian Rupees (INR). |
| 4 | km_driven |
The total distance the vehicle has been driven (in km). |
| 5 | fuel |
The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). |
| 6 | seller_type |
The type of seller (“Individual”, “Dealer” or “Trustmark Dealer”). |
| 7 | transmission |
The type of transmission system (“Manual” or “Automatic”). |
| 8 | owner |
The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). |
| 9 | mileage |
The fuel efficiency of the vehicle (in either km/l or km/kg). |
| 10 | engine |
The engine displacement capacity (in CC). |
| 11 | max_power |
The maximum power output of the vehicle’s engine, measured in brake horsepower (bhp). |
| 12 | torque |
A pair of torque and RPM values representing the max torque of the vehicle. The torque values are in either Nm or kgm, and the RPM values are either values or value ranges. |
| 13 | seats |
The seating capacity of the vehicle as an integer. |
The above attributes can be categorized into numeric and categorical variables. Numeric variables can be separated into discrete and continuous variables.
| Attribute Type | Attribute Name |
|---|---|
| Categorical Variables | name, fuel, seller_type,
transmission, owner |
| Discrete Numeric Variables | year, km_driven, seats |
| Continuous Numeric Variables | selling_price, mileage,
engine, max_power, torque |
# Set document-wide options
options(tibble.width = Inf) # Make tibbles display all columns
car_details = read_csv("Car details v3.csv")
nrow(car_details)
## [1] 8128
head(car_details)
## # A tibble: 6 × 13
## name year selling_price km_driven fuel seller_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Maruti Swift Dzire VDI 2014 450000 145500 Diesel Individual
## 2 Skoda Rapid 1.5 TDI Ambition 2014 370000 120000 Diesel Individual
## 3 Honda City 2017-2020 EXi 2006 158000 140000 Petrol Individual
## 4 Hyundai i20 Sportz Diesel 2010 225000 127000 Diesel Individual
## 5 Maruti Swift VXI BSIII 2007 130000 120000 Petrol Individual
## 6 Hyundai Xcent 1.2 VTVT E Plus 2017 440000 45000 Petrol Individual
## transmission owner mileage engine max_power
## <chr> <chr> <chr> <chr> <chr>
## 1 Manual First Owner 23.4 kmpl 1248 CC 74 bhp
## 2 Manual Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual Third Owner 17.7 kmpl 1497 CC 78 bhp
## 4 Manual First Owner 23.0 kmpl 1396 CC 90 bhp
## 5 Manual First Owner 16.1 kmpl 1298 CC 88.2 bhp
## 6 Manual First Owner 20.14 kmpl 1197 CC 81.86 bhp
## torque seats
## <chr> <dbl>
## 1 190Nm@ 2000rpm 5
## 2 250Nm@ 1500-2500rpm 5
## 3 12.7@ 2,700(kgm@ rpm) 5
## 4 22.4 kgm at 1750-2750rpm 5
## 5 11.5@ 4,500(kgm@ rpm) 5
## 6 113.75nm@ 4000rpm 5
car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906
sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717
name variable# Extract the first word from "name" to get the make of the vehicle
car_details$name = word(car_details$name,1)
# Rename "name" variable to "make"
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)
mileage variable# View raw data for "mileage" variable
head(car_details$mileage, n=20)
## [1] "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" "16.1 kmpl"
## [6] "20.14 kmpl" "17.3 km/kg" "16.1 kmpl" "23.59 kmpl" "20.0 kmpl"
## [11] "19.01 kmpl" "17.3 kmpl" "19.3 kmpl" "18.9 kmpl" "18.15 kmpl"
## [16] "24.52 kmpl" "23.0 kmpl" "19.7 kmpl" "22.54 kmpl" "21.0 kmpl"
# Drop 86 rows where "mileage" value contains the text "km/kg"
car_details <- car_details[!grepl("km/kg", car_details$mileage), ]
# Extract the numeric value from "mileage"
car_details$mileage = word(car_details$mileage,1)
# Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)
# Rename "mileage" to "fuel_efficiency"
colnames(car_details)[which(names(car_details) == "mileage")] = "fuel_efficiency"
# View new data for "fuel_efficiency" variable
head(car_details$fuel_efficiency, n=20)
## [1] 23.40 21.14 17.70 23.00 16.10 20.14 16.10 23.59 20.00 19.01 17.30 19.30
## [13] 18.90 18.15 24.52 23.00 19.70 22.54 21.00 25.50
Above, the mileage variable is renamed to
fuel_efficiency to prevent confusion with the
km_driven variable, which represents the quantity normally
referred to as vehicle “mileage”.
engine variable# View raw data for "engine" variable
head(car_details$engine, n=20)
## [1] "1248 CC" "1498 CC" "1497 CC" "1396 CC" "1298 CC" "1197 CC" "796 CC"
## [8] "1364 CC" "1399 CC" "1461 CC" "993 CC" "1248 CC" "1061 CC" "1198 CC"
## [15] "1248 CC" "1396 CC" "796 CC" "1396 CC" "1461 CC" "1498 CC"
# Extract the numeric value from each entry (remove the text "CC" from the end of each entry)
car_details$engine = word(car_details$engine,1)
# Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# View new data for "engine" variable
head(car_details$engine, n=20)
## [1] 1248 1498 1497 1396 1298 1197 796 1364 1399 1461 993 1248 1061 1198 1248
## [16] 1396 796 1396 1461 1498
max_power variable# Transform "max_power" variable
# View raw data for "max_power" variable
head(car_details$max_power, n=20)
## [1] "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" "88.2 bhp"
## [6] "81.86 bhp" "37 bhp" "67.1 bhp" "68.1 bhp" "108.45 bhp"
## [11] "60 bhp" "73.9 bhp" "67 bhp" "82 bhp" "88.5 bhp"
## [16] "90 bhp" "46.3 bhp" "88.73 bhp" "64.1 bhp" "98.6 bhp"
# Extract the numeric value from each entry of "max_power" (remove the text "bhp" from the end of each entry)
car_details$max_power = word(car_details$max_power,1)
# Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# View new data for "max_power" variable
head(car_details$max_power, n=20)
## [1] 74.00 103.52 78.00 90.00 88.20 81.86 37.00 67.10 68.10 108.45
## [11] 60.00 73.90 67.00 82.00 88.50 90.00 46.30 88.73 64.10 98.60
# Change datatypes of "fuel", "seller_type", "transmission", and "owner"
# from character to factor
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)
km_driven to
km_driven_in_10k# Modify the "km_driven" column by dividing the values of each entry by 10,000 and renaming the column to "km_driven_in_10k"
car_details$km_driven = car_details$km_driven/10000
# Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
# View data for "km_driven_in_10k" column
head(car_details)[4]
## # A tibble: 6 × 1
## km_driven_in_10k
## <dbl>
## 1 14.6
## 2 12
## 3 14
## 4 12.7
## 5 12
## 6 4.5
selling_price to
selling_price_in_10k# Modify "selling_price" column by dividing the values of each entry by 10,000 and renaming the column to "selling_price_in_10k"
car_details$selling_price = car_details$selling_price/10000
# Rename selling_price to "selling_price_in_10k"
colnames(car_details)[3]="selling_price_in_10k"
# View data for "selling_price_in_10k" column
head(car_details)[3]
## # A tibble: 6 × 1
## selling_price_in_10k
## <dbl>
## 1 45
## 2 37
## 3 15.8
## 4 22.5
## 5 13
## 6 44
torque column# Drop "torque" column
car_details = car_details[,!names(car_details) %in% "torque"]
make_category variable from make
variableunique(car_details$make)
## [1] Maruti Skoda Honda Hyundai Toyota
## [6] Ford Renault Mahindra Tata Chevrolet
## [11] Datsun Jeep Mercedes-Benz Mitsubishi Audi
## [16] Volkswagen BMW Nissan Lexus Jaguar
## [21] Land MG Volvo Daewoo Kia
## [26] Fiat Force Ambassador Ashok Isuzu
## [31] Opel
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
# Create new column "make_category" from "make" column
car_details$make_category = ifelse(car_details$make %in%
c("Maruti", "Tata", "Mahindra", "Datsun",
"Renault", "Chevrolet", "Fiat",
"Daewoo", "Ambassador", "Ashok"),
"Budget",
ifelse(car_details$make %in%
c("Honda", "Ford", "Hyundai","Toyota", "Volkswagen","Nissan",
"Skoda", "Mitsubishi","Force", "Kia","MG"),
"Midrange",
"Luxury"))
# Convert make_category to a factor
car_details$make_category = as.factor(car_details$make_category)
# Output all levels of the make_category variable
levels(car_details$make_category)
## [1] "Budget" "Luxury" "Midrange"
# Create a summary table of makes and their categories and view the mapping
make_category_mapping <- unique(car_details[, c("make", "make_category")])
make_category_mapping_df <- as.data.frame(make_category_mapping)
print(make_category_mapping_df, row.names = FALSE)
## make make_category
## Maruti Budget
## Skoda Midrange
## Honda Midrange
## Hyundai Midrange
## Toyota Midrange
## Ford Midrange
## Renault Budget
## Mahindra Budget
## Tata Budget
## Chevrolet Budget
## Datsun Budget
## Jeep Luxury
## Mercedes-Benz Luxury
## Mitsubishi Midrange
## Audi Luxury
## Volkswagen Midrange
## BMW Luxury
## Nissan Midrange
## Lexus Luxury
## Jaguar Luxury
## Land Luxury
## MG Midrange
## Volvo Luxury
## Daewoo Budget
## Kia Midrange
## Fiat Budget
## Force Midrange
## Ambassador Budget
## Ashok Budget
## Isuzu Luxury
## Opel Luxury
car_details dataset| Attribute Name | Description | Variable Type |
|---|---|---|
make |
The manufacturer of the vehicle (e.g., Maruti, Hyundai, Honda, etc.). | Factor |
year |
The year the vehicle was manufactured. | Numeric |
selling_price_in_10k |
The selling price of the vehicle in ten-thousands of Indian Rupees (INR). | Numeric |
km_driven_in_10k |
The total distance the vehicle has been driven, measured in ten-thousands of kilometers. | Numeric |
fuel |
The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). | Factor |
seller_type |
The type of seller (“Individual”, “Dealer”, or “Trustmark Dealer”). | Factor |
transmission |
The type of transmission system (“Manual” or “Automatic”). | Factor |
owner |
The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). | Factor |
fuel_efficiency |
The fuel efficiency of the vehicle in km/l. | Numeric |
engine |
The engine displacement capacity of the vehicle, measured in cubic centimeters (CC). | Numeric |
max_power |
The maximum power output of the vehicle’s engine, measured in brake horsepower (BHP). | Numeric |
seats |
The seating capacity of the vehicle. | Numeric |
make_category |
Categorized vehicle make (e.g., “Luxury”, “Midrange”, or “Budget”). | Factor |
head(car_details, n=8)
## # A tibble: 8 × 13
## make year selling_price_in_10k km_driven_in_10k fuel seller_type
## <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Maruti 2014 45 14.6 Diesel Individual
## 2 Skoda 2014 37 12 Diesel Individual
## 3 Honda 2006 15.8 14 Petrol Individual
## 4 Hyundai 2010 22.5 12.7 Diesel Individual
## 5 Maruti 2007 13 12 Petrol Individual
## 6 Hyundai 2017 44 4.5 Petrol Individual
## 7 Maruti 2001 4.5 0.5 Petrol Individual
## 8 Toyota 2011 35 9 Diesel Individual
## transmission owner fuel_efficiency engine max_power seats make_category
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 Manual First Owner 23.4 1248 74 5 Budget
## 2 Manual Second Owner 21.1 1498 104. 5 Midrange
## 3 Manual Third Owner 17.7 1497 78 5 Midrange
## 4 Manual First Owner 23 1396 90 5 Midrange
## 5 Manual First Owner 16.1 1298 88.2 5 Budget
## 6 Manual First Owner 20.1 1197 81.9 5 Midrange
## 7 Manual Second Owner 16.1 796 37 4 Budget
## 8 Manual First Owner 23.6 1364 67.1 5 Midrange
car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
selling_price = 1000.# Histogram of selling prices
hist(car_details$selling_price_in_10k,xlab="Selling Price (in 10,000 INR)",
main="Selling Price distribution",
breaks = 30,
col = "lightblue")
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)
par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
xlab = "Number of Cars",
ylab = "Car Make",
col = "lightblue",
cex.names = 0.5,
main = "Num. of Cars in Each Make")
# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Fuel Type")
# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Seller Type")
# Plot 4: Number of cars by transmission type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Trans Type")
# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Owner Type")
# Boxplot of selling price by car make
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
las = 2,cex.axis = 0.7,
main = "Boxplot of Selling Price by Car Make",
xlab = "Car Make",
ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by fuel type
boxplot(selling_price_in_10k ~ fuel, data = car_details, cex.axis = 0.8,
col = rainbow(length(unique(car_details$fuel))),
main = "Box Plot of Selling Price by Fuel Type",
xlab = "Fuel Type", ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by seller type
boxplot(selling_price_in_10k ~ seller_type, data = car_details, cex.axis = 0.6,
col = rainbow(length(unique(car_details$seller_type))),
main = "Box Plot of Selling Price by Seller Type",
xlab = "Seller Type", ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by transmission type
plot(selling_price_in_10k ~ transmission, data = car_details,
col = rainbow(length(unique(car_details$transmission))),
main = "Boxplot of Selling Price by Transmission Type",
xlab = "Transmission Type",
ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by owner type
plot(selling_price_in_10k ~ owner, data = car_details, las = 2, cex.axis = 0.5,
col = rainbow(length(unique(car_details$owner))),
main = "Boxplot of Selling Price vs Owner Type",
xlab = "Owner Type",
ylab = "Selling Price (10,000 INR)")
# Scatter plot of selling price vs year and transmission type
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
main = "Selling Price vs Year and Transmission Type",
xlab = "Year",
ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Automatic", "Manual"),
col = c("blue", "red"), pch = 19)
# Scatter plot of selling price vs year and make category
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
col = colours[car_details$make_category],pch=19,
main = "Selling Price vs Year and Make Category",
xlab = "Year",
ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
col = c("blue", "green","orange"), pch = 19)
# Scatter plot of selling price vs km driven
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="Km Driven (10,000 km)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Km Driven")
# Subset out extreme values
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)
# Scatter plot of selling price vs km driven after removing extreme values
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="Km Driven (10,000 km)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Km Driven (extreme values removed)")
# Subset out 15 observations with fuel_efficiency=0
car_details = subset(car_details, car_details$fuel_efficiency!=0,)
# Scatter plot of selling price vs fuel efficiency
plot(selling_price_in_10k ~ fuel_efficiency, data=car_details,
xlab="Fuel Efficiency (km/l)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Fuel Efficiency")
# Scatter plot of selling price vs max power (by fuel type)
colours = c("Petrol" = "blue", "Diesel"="green")
plot(selling_price_in_10k ~ max_power, data=car_details,
col=colours[car_details$fuel],
main="Selling Price vs Max Power (by Fuel Type)",
xlab="Max Power (bhp)",
ylab="Selling Price (10,000 INR)" )
legend("topleft", legend = c("Petrol","Diesel"),
col = c("blue", "green"), pch = 19)
# Pairs plot for numeric variables
pairs(selling_price_in_10k ~ year + km_driven_in_10k + fuel_efficiency + engine + max_power + seats, data = car_details)
# Calculate correlation matrix for numeric variables
cor_mat = cor(car_details[, sapply(car_details, is.numeric)])
cor_mat
## year selling_price_in_10k km_driven_in_10k
## year 1.00000000 0.4395767 -0.45496390
## selling_price_in_10k 0.43957666 1.0000000 -0.19812586
## km_driven_in_10k -0.45496390 -0.1981259 1.00000000
## fuel_efficiency 0.37352047 -0.1258892 -0.24186898
## engine -0.02333154 0.4480476 0.30860425
## max_power 0.15993052 0.6857687 0.04886655
## seats 0.02630866 0.1648269 0.24955889
## fuel_efficiency engine max_power seats
## year 0.3735205 -0.02333154 0.15993052 0.02630866
## selling_price_in_10k -0.1258892 0.44804759 0.68576871 0.16482693
## km_driven_in_10k -0.2418690 0.30860425 0.04886655 0.24955889
## fuel_efficiency 1.0000000 -0.59909460 -0.40078564 -0.48552947
## engine -0.5990946 1.00000000 0.68399138 0.66312580
## max_power -0.4007856 0.68399138 1.00000000 0.25911446
## seats -0.4855295 0.66312580 0.25911446 1.00000000
# Extract high correlation values from the upper triangle of the correlation
# matrix (to remove redundant correlations caused by symmetry)
high_cor_indices = which(upper.tri(cor_mat) & cor_mat > 0.5, arr.ind = TRUE)
# Create table for high-correlation variables
high_cor_df = data.frame(
row = rownames(cor_mat)[high_cor_indices[, 1]],
column = colnames(cor_mat)[high_cor_indices[, 2]],
correlation = cor_mat[high_cor_indices]
)
# Display the first few entries in the table
head(high_cor_df)
## row column correlation
## 1 selling_price_in_10k max_power 0.6857687
## 2 engine max_power 0.6839914
## 3 engine seats 0.6631258
max_power and selling_price have a high
correlation of 0.6857687. Hence, selling price can increases with max
power.max_power and engine have a high
correlation of 0.6839914. This indicates that engine displacement and
max_power are collinear.seats and engine have high a correlation
of 0.6631258. This indicates that engine displacement and seats have
high collinearity.In this section, a model for predicting values of
selling_price based on the values of the other variables is
developed and the performance of the model is analyzed.
set.seed(125)
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
# Remove `make` from train_data and test_data
#There are 31 unique values for the field "make" and hence there will be at least 30 dummy variables if "make" is used in model building. This increases the complexity of the model and hence a broader category "make_category" is used.
train_data <- subset(train_data, select = -make)
test_data <- subset(test_data, select = -make)
nrow(train_data)
## [1] 5289
nrow(test_data)
## [1] 1323
# Definition of "full" linear model
full_model = lm(selling_price_in_10k ~ ., data = train_data)
# Model summary
summary(full_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.47 -11.13 -1.44 8.25 321.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.830e+03 2.819e+02 -27.777 < 2e-16 ***
## year 3.902e+00 1.414e-01 27.608 < 2e-16 ***
## km_driven_in_10k -8.666e-01 9.579e-02 -9.047 < 2e-16 ***
## fuelPetrol -9.928e+00 1.202e+00 -8.261 < 2e-16 ***
## seller_typeIndividual -8.391e+00 1.298e+00 -6.462 1.12e-10 ***
## seller_typeTrustmark Dealer -5.155e+00 5.535e+00 -0.931 0.35168
## transmissionManual -1.522e+01 1.542e+00 -9.869 < 2e-16 ***
## ownerFourth & Above Owner -2.120e+00 2.663e+00 -0.796 0.42610
## ownerSecond Owner -4.924e+00 9.240e-01 -5.328 1.03e-07 ***
## ownerTest Drive Car 2.483e+02 1.217e+01 20.405 < 2e-16 ***
## ownerThird Owner -3.582e+00 1.539e+00 -2.328 0.01995 *
## fuel_efficiency -3.249e-01 1.809e-01 -1.796 0.07249 .
## engine 9.718e-03 1.811e-03 5.366 8.41e-08 ***
## max_power 4.635e-01 2.086e-02 22.224 < 2e-16 ***
## seats 1.160e+00 6.149e-01 1.886 0.05935 .
## make_categoryLuxury 1.034e+02 3.115e+00 33.195 < 2e-16 ***
## make_categoryMidrange 2.517e+00 8.482e-01 2.968 0.00301 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.9 on 5272 degrees of freedom
## Multiple R-squared: 0.7122, Adjusted R-squared: 0.7113
## F-statistic: 815.4 on 16 and 5272 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(full_model))
## [1] 17
#Leave one out cross validation RMSE
sqrt(mean((resid(full_model) / (1 - hatvalues(full_model))) ^ 2))
## [1] 27.30411
# Diagnostic plots
par(mfrow = c(1, 2))
plot(full_model,which=c(1,2))
# Diagnostic tests
bp_test <- bptest(full_model)
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: full_model
## BP = 1310, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(full_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.71385, p-value < 2.2e-16
# Multicollinearity test
vif(full_model)
## GVIF Df GVIF^(1/(2*Df))
## year 2.184945 1 1.478156
## km_driven_in_10k 1.643485 1 1.281985
## fuel 2.603160 1 1.613431
## seller_type 1.117854 2 1.028244
## transmission 1.404885 1 1.185278
## owner 1.407956 4 1.043695
## fuel_efficiency 3.679447 1 1.918188
## engine 5.853557 1 2.419412
## max_power 3.175790 1 1.782075
## seats 2.684514 1 1.638449
## make_category 1.929111 2 1.178526
There are 17 parameters used in the full additive linear model, which is relatively high.
The R^2 value of the model is 0.7122. That is 71.22% of the variance in selling_price_in_10k is explained by the predictors.
All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.
The Breusch-Pagan failed with test statistic value of BP = 1310 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.71385 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 27.30411, which is low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “seller_type”(p_value=0.372486) and “seats” (p_value=0.888617)
small_add_model = lm(selling_price_in_10k ~ . -seller_type -seats , data = train_data)
# Model summary
summary(small_add_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ . - seller_type - seats,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.71 -11.03 -1.54 8.09 326.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.953e+03 2.722e+02 -29.214 < 2e-16 ***
## year 3.963e+00 1.361e-01 29.129 < 2e-16 ***
## km_driven_in_10k -9.068e-01 9.577e-02 -9.468 < 2e-16 ***
## fuelPetrol -1.032e+01 1.194e+00 -8.641 < 2e-16 ***
## transmissionManual -1.583e+01 1.536e+00 -10.307 < 2e-16 ***
## ownerFourth & Above Owner -2.772e+00 2.670e+00 -1.038 0.29924
## ownerSecond Owner -5.537e+00 9.231e-01 -5.998 2.13e-09 ***
## ownerTest Drive Car 2.538e+02 1.219e+01 20.822 < 2e-16 ***
## ownerThird Owner -4.388e+00 1.540e+00 -2.850 0.00439 **
## fuel_efficiency -3.962e-01 1.725e-01 -2.297 0.02163 *
## engine 1.111e-02 1.656e-03 6.708 2.17e-11 ***
## max_power 4.640e-01 2.035e-02 22.800 < 2e-16 ***
## make_categoryLuxury 1.031e+02 3.063e+00 33.666 < 2e-16 ***
## make_categoryMidrange 2.429e+00 8.085e-01 3.004 0.00268 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.01 on 5275 degrees of freedom
## Multiple R-squared: 0.7097, Adjusted R-squared: 0.709
## F-statistic: 992.1 on 13 and 5275 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(small_add_model))
## [1] 14
#Leave one out cross validation RMSE
sqrt(mean((resid(small_add_model) / (1 - hatvalues(small_add_model))) ^ 2))
## [1] 27.39708
# Diagnostic plots
par(mfrow = c(1, 2))
plot(small_add_model,which=c(1,2))
# Diagnostic tests
bptest(small_add_model)
##
## studentized Breusch-Pagan test
##
## data: small_add_model
## BP = 1308.8, df = 13, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(small_add_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.70295, p-value < 2.2e-16
# Multicollinearity test
vif(small_add_model)
## GVIF Df GVIF^(1/(2*Df))
## year 2.008053 1 1.417058
## km_driven_in_10k 1.629810 1 1.276640
## fuel 2.550651 1 1.597076
## transmission 1.381643 1 1.175433
## owner 1.381414 4 1.041215
## fuel_efficiency 3.319208 1 1.821869
## engine 4.857050 1 2.203872
## max_power 2.999960 1 1.732039
## make_category 1.712363 2 1.143929
There are 14 parameters used in the small additive linear model.
The R^2 value of the model is 0.7097,which is less than the R^2 value of full model. That is 70.97% of the variance in selling_price_in_10k is explained by the predictors.
All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.
The Breusch-Pagan failed with test statistic value of BP = 1308.8 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.70295 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 27.39708, which is low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “ownerFourth & Above Owner”(p_value=0.29924) and fuel_efficiency(p_value = 0.02163)
anova(small_add_model, full_model)
## Analysis of Variance Table
##
## Model 1: selling_price_in_10k ~ (year + km_driven_in_10k + fuel + seller_type +
## transmission + owner + fuel_efficiency + engine + max_power +
## seats + make_category) - seller_type - seats
## Model 2: selling_price_in_10k ~ year + km_driven_in_10k + fuel + seller_type +
## transmission + owner + fuel_efficiency + engine + max_power +
## seats + make_category
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5275 3848848
## 2 5272 3816027 3 32820 15.114 8.584e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above Analysis of Variance(ANOVA) result we can say that Full additive model is statistically better than small additive model, based on the high F-test low p_value and lower RSS.
Full additive model has lower RSS (3816027) than small additive model (3848848), indicating that full additive model fits the data better.
Low p_value indicates that the small additive model is NOT sufficient in explaining the variance of the response variable.
Hence we reject the null hypothesis that H0: small additive model is sufficient in explaining the variance of the response variable.
From the ANOVA result we can see the Full additive linear model is better than small additive linear model.
But in both the above models we can see that the constant variance assumption of the residuals and normality assumptions of the residuals are violated.
max_power*fuel: This interaction term is between a numeric predictor “max_power” and a categorical predictor “fuel type”. This is used in the interaction model based on the results of scatter plot “Selling Price vs Max Power (by Fuel Type)” in the data analysis session.
transmission*seller_type: This interaction is between two categorical variables transmission type and seller type. This is used in the interaction model based on the results of interaction plot, as indicated below.
par(mfrow = c(2, 2))
interaction.plot(train_data$fuel, train_data$owner, train_data$selling_price_in_10k)
interaction.plot(train_data$fuel, train_data$transmission, train_data$selling_price_in_10k)
interaction.plot(train_data$transmission, train_data$owner, train_data$selling_price_in_10k)
interaction.plot(train_data$fuel, train_data$make_category, train_data$selling_price_in_10k)
par(mfrow = c(2, 2))
interaction.plot(train_data$make_category, train_data$owner, train_data$selling_price_in_10k)
interaction.plot(train_data$transmission, train_data$make_category, train_data$selling_price_in_10k)
interaction.plot(train_data$make_category, train_data$seller_type, train_data$selling_price_in_10k)
interaction.plot(train_data$seller_type, train_data$owner, train_data$selling_price_in_10k)
par(mfrow = c(1, 2))
interaction.plot(train_data$seller_type, train_data$fuel, train_data$selling_price_in_10k)
interaction.plot(train_data$transmission, train_data$seller_type, train_data$selling_price_in_10k)
n=nrow(train_data)
bic_int_model = step(lm(selling_price_in_10k ~ year + make_category + max_power*fuel + transmission*seller_type + owner + engine, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_int_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power *
## fuel + transmission * seller_type + owner + engine, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -217.563 -10.459 -1.017 8.923 307.953
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.850e+03 2.256e+02 -39.233
## year 4.407e+00 1.118e-01 39.413
## make_categoryLuxury 9.106e+01 3.044e+00 29.915
## make_categoryMidrange 3.360e+00 7.940e-01 4.231
## max_power 5.972e-01 2.140e-02 27.899
## fuelPetrol 2.924e+01 2.463e+00 11.873
## transmissionManual -4.403e+01 2.884e+00 -15.270
## seller_typeIndividual -3.773e+01 2.798e+00 -13.485
## seller_typeTrustmark Dealer -3.012e+01 9.124e+00 -3.302
## ownerFourth & Above Owner -3.651e+00 2.597e+00 -1.406
## ownerSecond Owner -5.409e+00 8.981e-01 -6.023
## ownerTest Drive Car 2.596e+02 1.204e+01 21.568
## ownerThird Owner -4.007e+00 1.498e+00 -2.676
## engine 1.379e-02 1.163e-03 11.855
## max_power:fuelPetrol -4.182e-01 2.718e-02 -15.390
## transmissionManual:seller_typeIndividual 3.490e+01 3.108e+00 11.230
## transmissionManual:seller_typeTrustmark Dealer 3.783e+01 1.131e+01 3.345
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## year < 2e-16 ***
## make_categoryLuxury < 2e-16 ***
## make_categoryMidrange 2.36e-05 ***
## max_power < 2e-16 ***
## fuelPetrol < 2e-16 ***
## transmissionManual < 2e-16 ***
## seller_typeIndividual < 2e-16 ***
## seller_typeTrustmark Dealer 0.000967 ***
## ownerFourth & Above Owner 0.159804
## ownerSecond Owner 1.83e-09 ***
## ownerTest Drive Car < 2e-16 ***
## ownerThird Owner 0.007477 **
## engine < 2e-16 ***
## max_power:fuelPetrol < 2e-16 ***
## transmissionManual:seller_typeIndividual < 2e-16 ***
## transmissionManual:seller_typeTrustmark Dealer 0.000827 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.25 on 5272 degrees of freedom
## Multiple R-squared: 0.726, Adjusted R-squared: 0.7252
## F-statistic: 873.1 on 16 and 5272 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_int_model))
## [1] 17
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_int_model) / (1 - hatvalues(bic_int_model))) ^ 2))
## [1] 26.73532
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_int_model,which=c(1,2))
# Diagnostic tests
bptest(bic_int_model)
##
## studentized Breusch-Pagan test
##
## data: bic_int_model
## BP = 1522.5, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_int_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.73061, p-value < 2.2e-16
# Multicollinearity test
vif(bic_int_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## year 1.436331 1 1.198470
## make_category 1.857596 2 1.167449
## max_power 3.513535 1 1.874443
## fuel 11.482762 1 3.388623
## transmission 5.157209 1 2.270949
## seller_type 15.303631 2 1.977874
## owner 1.432764 4 1.045976
## engine 2.536586 1 1.592666
## max_power:fuel 9.919899 1 3.149587
## transmission:seller_type 28.618175 2 2.312919
There are 17 parameters used in the interaction model after BIC step wise feature selection, which is relatively high.
The R^2 value of the model is 0.726,which is less than the R^2 value of full model. That is 72.6% of the variance in selling_price_in_10k is explained by the predictors.
The variables transmission, seller_type, fuel and max_power have a Variance Inflation Factor (VIF) > 5, indicating multicollinearity among all predictors used in the interaction term.
The Breusch-Pagan failed with very high test statistic value of BP = 1522.5 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.73061 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 26.73532, which is low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “ownerFourth & Above Owner”(p_value=0.159804)
train_data$selling_price_in_10k = log(train_data$selling_price_in_10k)
# Definition of full additive linear model with BIC-based stepwise feature selection starting from full model (both directions)
n= nrow(train_data)
bic_full_add_model_log = step(lm(selling_price_in_10k ~ ., data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_full_add_model_log)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + km_driven_in_10k +
## fuel + seller_type + transmission + owner + fuel_efficiency +
## engine + max_power + seats + make_category, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60851 -0.17245 0.01754 0.19210 1.35612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.161e+02 3.003e+00 -71.946 < 2e-16 ***
## year 1.084e-01 1.506e-03 71.974 < 2e-16 ***
## km_driven_in_10k -4.752e-03 1.020e-03 -4.657 3.28e-06 ***
## fuelPetrol -1.456e-01 1.280e-02 -11.374 < 2e-16 ***
## seller_typeIndividual -5.924e-02 1.383e-02 -4.283 1.88e-05 ***
## seller_typeTrustmark Dealer 7.623e-02 5.896e-02 1.293 0.196
## transmissionManual -1.174e-01 1.643e-02 -7.144 1.03e-12 ***
## ownerFourth & Above Owner -1.192e-01 2.837e-02 -4.201 2.70e-05 ***
## ownerSecond Owner -7.388e-02 9.843e-03 -7.506 7.13e-14 ***
## ownerTest Drive Car 6.353e-01 1.296e-01 4.901 9.80e-07 ***
## ownerThird Owner -1.074e-01 1.639e-02 -6.551 6.27e-11 ***
## fuel_efficiency 1.477e-02 1.927e-03 7.669 2.05e-14 ***
## engine 2.269e-04 1.929e-05 11.759 < 2e-16 ***
## max_power 7.989e-03 2.222e-04 35.958 < 2e-16 ***
## seats 7.726e-02 6.550e-03 11.796 < 2e-16 ***
## make_categoryLuxury 4.540e-01 3.318e-02 13.681 < 2e-16 ***
## make_categoryMidrange 1.313e-01 9.036e-03 14.530 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2866 on 5272 degrees of freedom
## Multiple R-squared: 0.8518, Adjusted R-squared: 0.8514
## F-statistic: 1894 on 16 and 5272 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_full_add_model_log))
## [1] 17
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_full_add_model_log) / (1 - hatvalues(bic_full_add_model_log))) ^ 2))
## [1] 0.2871987
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_full_add_model_log,which=c(1,2))
# Diagnostic tests
bp_test = bptest(bic_full_add_model_log)
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: bic_full_add_model_log
## BP = 387.08, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_full_add_model_log), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.98726, p-value < 2.2e-16
# Multicollinearity test
vif(bic_full_add_model_log)
## GVIF Df GVIF^(1/(2*Df))
## year 2.184945 1 1.478156
## km_driven_in_10k 1.643485 1 1.281985
## fuel 2.603160 1 1.613431
## seller_type 1.117854 2 1.028244
## transmission 1.404885 1 1.185278
## owner 1.407956 4 1.043695
## fuel_efficiency 3.679447 1 1.918188
## engine 5.853557 1 2.419412
## max_power 3.175790 1 1.782075
## seats 2.684514 1 1.638449
## make_category 1.929111 2 1.178526
There are 17 parameters used in the interaction model after BIC step wise feature selection, which is relatively high.
The full additive model after log transformation has R^2=0.8518, indicating that 85.18% of the variance in selling_price_in_10k is explained by the predictors.
All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.
The Breusch-Pagan failed with test statistic value of BP = 387.08 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.98726 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 0.2871987, which is very low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.196)
bic_small_add_model_log = step(lm(selling_price_in_10k ~ year + make_category + max_power + transmission + fuel + seller_type + owner +engine, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_small_add_model_log)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power +
## transmission + fuel + seller_type + owner + engine, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59461 -0.17061 0.01788 0.19184 1.33769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.348e+02 2.493e+00 -94.197 < 2e-16 ***
## year 1.180e-01 1.236e-03 95.497 < 2e-16 ***
## make_categoryLuxury 3.908e-01 3.299e-02 11.845 < 2e-16 ***
## make_categoryMidrange 9.883e-02 8.720e-03 11.333 < 2e-16 ***
## max_power 7.378e-03 2.189e-04 33.711 < 2e-16 ***
## transmissionManual -1.035e-01 1.661e-02 -6.228 5.09e-10 ***
## fuelPetrol -1.863e-01 9.775e-03 -19.057 < 2e-16 ***
## seller_typeIndividual -6.535e-02 1.400e-02 -4.666 3.15e-06 ***
## seller_typeTrustmark Dealer 5.958e-02 5.989e-02 0.995 0.32
## ownerFourth & Above Owner -1.138e-01 2.880e-02 -3.952 7.86e-05 ***
## ownerSecond Owner -8.023e-02 9.961e-03 -8.055 9.77e-16 ***
## ownerTest Drive Car 6.275e-01 1.317e-01 4.764 1.94e-06 ***
## ownerThird Owner -1.157e-01 1.660e-02 -6.966 3.67e-12 ***
## engine 2.606e-04 1.284e-05 20.286 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2912 on 5275 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8466
## F-statistic: 2245 on 13 and 5275 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_small_add_model_log))
## [1] 14
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_small_add_model_log) / (1 - hatvalues(bic_small_add_model_log))) ^ 2))
## [1] 0.2916997
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_small_add_model_log,which=c(1,2))
# Diagnostic tests
bptest(bic_small_add_model_log)
##
## studentized Breusch-Pagan test
##
## data: bic_small_add_model_log
## BP = 390.86, df = 13, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_small_add_model_log), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.98479, p-value < 2.2e-16
# Multicollinearity test
vif(bic_small_add_model_log)
## GVIF Df GVIF^(1/(2*Df))
## year 1.426062 1 1.194179
## make_category 1.708071 2 1.143211
## max_power 2.984259 1 1.727501
## transmission 1.390403 1 1.179153
## fuel 1.469690 1 1.212308
## seller_type 1.109027 2 1.026208
## owner 1.390298 4 1.042050
## engine 2.512669 1 1.585140
There are 14 parameters used in the interaction model after BIC step wise feature selection.
The small additive model after log transformation has R^2=0.8469, indicating that 84.69% of the variance in selling_price_in_10k is explained by the predictors.
All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.
The Breusch-Pagan failed with test statistic value of BP = 390.86 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.98479 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 0.2916997, which is very low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.32)
anova(bic_small_add_model_log, bic_full_add_model_log)
## Analysis of Variance Table
##
## Model 1: selling_price_in_10k ~ year + make_category + max_power + transmission +
## fuel + seller_type + owner + engine
## Model 2: selling_price_in_10k ~ year + km_driven_in_10k + fuel + seller_type +
## transmission + owner + fuel_efficiency + engine + max_power +
## seats + make_category
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5275 447.40
## 2 5272 433.03 3 14.366 58.301 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
indicies_to_exclude = unname(which(cooks.distance(bic_small_add_model_log) > 4/length(cooks.distance(bic_small_add_model_log))))
train_data_filtered = train_data[-indicies_to_exclude,]
#Number of excluded observations
nrow(train_data) - nrow(train_data_filtered)
## [1] 275
n= nrow(train_data_filtered)
refitted_bic_small_add_model_log = lm(formula = formula(bic_small_add_model_log), data = train_data_filtered)
# Model summary
summary(refitted_bic_small_add_model_log)
##
## Call:
## lm(formula = formula(bic_small_add_model_log), data = train_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91499 -0.16399 0.01396 0.17626 0.87203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.385e+02 2.351e+00 -101.419 < 2e-16 ***
## year 1.198e-01 1.166e-03 102.774 < 2e-16 ***
## make_categoryLuxury 3.275e-01 3.318e-02 9.869 < 2e-16 ***
## make_categoryMidrange 7.794e-02 7.729e-03 10.084 < 2e-16 ***
## max_power 7.895e-03 2.014e-04 39.200 < 2e-16 ***
## transmissionManual -9.728e-02 1.499e-02 -6.491 9.37e-11 ***
## fuelPetrol -1.772e-01 8.716e-03 -20.333 < 2e-16 ***
## seller_typeIndividual -7.352e-02 1.250e-02 -5.880 4.37e-09 ***
## seller_typeTrustmark Dealer 9.465e-03 7.136e-02 0.133 0.89449
## ownerFourth & Above Owner -9.419e-02 3.216e-02 -2.929 0.00342 **
## ownerSecond Owner -6.530e-02 8.900e-03 -7.337 2.53e-13 ***
## ownerThird Owner -9.013e-02 1.543e-02 -5.840 5.56e-09 ***
## engine 2.717e-04 1.175e-05 23.120 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2529 on 5001 degrees of freedom
## Multiple R-squared: 0.8671, Adjusted R-squared: 0.8668
## F-statistic: 2719 on 12 and 5001 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(refitted_bic_small_add_model_log))
## [1] 13
#Leave one out cross validation RMSE
sqrt(mean((resid(refitted_bic_small_add_model_log) / (1 - hatvalues(refitted_bic_small_add_model_log))) ^ 2))
## [1] 0.2531371
# Diagnostic plots
par(mfrow = c(1, 2))
plot(refitted_bic_small_add_model_log,which=c(1,2))
# Diagnostic tests
bptest(refitted_bic_small_add_model_log)
##
## studentized Breusch-Pagan test
##
## data: refitted_bic_small_add_model_log
## BP = 293.12, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(refitted_bic_small_add_model_log), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.99396, p-value = 1.123e-13
# Multicollinearity test
vif(refitted_bic_small_add_model_log)
## GVIF Df GVIF^(1/(2*Df))
## year 1.415566 1 1.189776
## make_category 1.522897 2 1.110881
## max_power 2.696754 1 1.642180
## transmission 1.280517 1 1.131599
## fuel 1.470022 1 1.212445
## seller_type 1.094271 2 1.022778
## owner 1.351841 3 1.051528
## engine 2.418486 1 1.555148
There are 13 parameters used in the interaction model after BIC step wise feature selection.
The small refitted additive model after log transformation has R^2=0.8671, indicating that 86.71% of the variance in selling_price_in_10k is explained by the predictors.
All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.
The Breusch-Pagan failed with test statistic value of BP = 293.12 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.99396 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 0.2531371, which is very low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.89449)
fitted = fitted(refitted_bic_small_add_model_log)
weights = 1 / (fitted^2)
weighted_model = lm(formula=formula(refitted_bic_small_add_model_log),data=train_data_filtered,weights=weights)
# Model summary
summary(weighted_model)
##
## Call:
## lm(formula = formula(refitted_bic_small_add_model_log), data = train_data_filtered,
## weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.30968 -0.04312 0.00466 0.04837 0.33746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.291e+02 1.994e+00 -114.912 < 2e-16 ***
## year 1.152e-01 9.897e-04 116.376 < 2e-16 ***
## make_categoryLuxury 3.262e-01 4.715e-02 6.917 5.18e-12 ***
## make_categoryMidrange 7.087e-02 8.253e-03 8.588 < 2e-16 ***
## max_power 8.081e-03 2.252e-04 35.876 < 2e-16 ***
## transmissionManual -9.595e-02 1.809e-02 -5.303 1.19e-07 ***
## fuelPetrol -1.667e-01 9.262e-03 -17.993 < 2e-16 ***
## seller_typeIndividual -8.198e-02 1.432e-02 -5.725 1.09e-08 ***
## seller_typeTrustmark Dealer 8.932e-03 8.664e-02 0.103 0.918
## ownerFourth & Above Owner -1.288e-01 2.536e-02 -5.079 3.93e-07 ***
## ownerSecond Owner -7.539e-02 8.813e-03 -8.553 < 2e-16 ***
## ownerThird Owner -9.645e-02 1.384e-02 -6.970 3.58e-12 ***
## engine 2.626e-04 1.299e-05 20.219 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07434 on 5001 degrees of freedom
## Multiple R-squared: 0.8958, Adjusted R-squared: 0.8955
## F-statistic: 3582 on 12 and 5001 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(weighted_model))
## [1] 13
#Leave one out cross validation RMSE
sqrt(mean((resid(weighted_model) / (1 - hatvalues(weighted_model))) ^ 2))
## [1] 0.2536943
# Diagnostic plots
par(mfrow = c(1, 2))
plot(weighted_model,which=c(1,2))
# Diagnostic tests
bptest(weighted_model)
##
## studentized Breusch-Pagan test
##
## data: weighted_model
## BP = 144.39, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(weighted_model), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.99306, p-value = 6.801e-15
# Multicollinearity test
vif(weighted_model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.541822 1 1.241701
## make_category 1.419449 2 1.091516
## max_power 2.715823 1 1.647975
## transmission 1.205858 1 1.098116
## fuel 1.595998 1 1.263328
## seller_type 1.076706 2 1.018648
## owner 1.359096 3 1.052467
## engine 2.560109 1 1.600034
There are 13 parameters used in the interaction model after BIC step wise feature selection.
The small refitted additive model after log transformation has R^2=0.8958, indicating that 89.58% of the variance in selling_price_in_10k is explained by the predictors.
All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.
The Breusch-Pagan failed with test statistic value of BP = 144.39 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.99306 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 0.2536943, which is very low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.918)
n=nrow(train_data)
bic_int_model_log = step(lm(selling_price_in_10k ~ (year + make_category + max_power + fuel + transmission + seller_type + owner + engine)^2, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_int_model_log)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power +
## fuel + transmission + owner + engine + year:make_category +
## year:max_power + year:fuel + year:engine + make_category:max_power +
## make_category:fuel + make_category:transmission + make_category:engine +
## max_power:fuel + max_power:engine + fuel:engine, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39319 -0.15527 0.02382 0.17903 1.09406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.297e+02 9.750e+00 -23.563 < 2e-16
## year 1.152e-01 4.842e-03 23.790 < 2e-16
## make_categoryLuxury 5.886e+01 1.773e+01 3.320 0.000906
## make_categoryMidrange -8.295e+00 4.934e+00 -1.681 0.092783
## max_power -1.100e+00 1.120e-01 -9.824 < 2e-16
## fuelPetrol 5.786e+01 5.410e+00 10.694 < 2e-16
## transmissionManual -1.063e-01 2.319e-02 -4.586 4.63e-06
## ownerFourth & Above Owner -9.922e-02 2.686e-02 -3.694 0.000223
## ownerSecond Owner -6.833e-02 9.293e-03 -7.352 2.25e-13
## ownerTest Drive Car 4.369e-01 1.311e-01 3.333 0.000866
## ownerThird Owner -1.073e-01 1.547e-02 -6.939 4.42e-12
## engine 3.873e-02 7.375e-03 5.251 1.57e-07
## year:make_categoryLuxury -2.946e-02 8.786e-03 -3.353 0.000805
## year:make_categoryMidrange 3.877e-03 2.446e-03 1.585 0.113074
## year:max_power 5.547e-04 5.558e-05 9.981 < 2e-16
## year:fuelPetrol -2.872e-02 2.684e-03 -10.698 < 2e-16
## year:engine -1.897e-05 3.664e-06 -5.177 2.34e-07
## make_categoryLuxury:max_power -1.695e-03 1.238e-03 -1.370 0.170786
## make_categoryMidrange:max_power -3.766e-03 4.658e-04 -8.086 7.58e-16
## make_categoryLuxury:fuelPetrol 1.324e-01 9.359e-02 1.414 0.157295
## make_categoryMidrange:fuelPetrol 1.899e-01 2.054e-02 9.245 < 2e-16
## make_categoryLuxury:transmissionManual -3.519e-01 7.453e-02 -4.721 2.41e-06
## make_categoryMidrange:transmissionManual 1.562e-02 3.218e-02 0.485 0.627461
## make_categoryLuxury:engine 6.304e-04 1.085e-04 5.811 6.56e-09
## make_categoryMidrange:engine 5.636e-04 3.009e-05 18.731 < 2e-16
## max_power:fuelPetrol 3.266e-03 8.282e-04 3.944 8.13e-05
## max_power:engine -4.588e-06 3.289e-07 -13.947 < 2e-16
## fuelPetrol:engine -4.266e-04 6.460e-05 -6.605 4.38e-11
##
## (Intercept) ***
## year ***
## make_categoryLuxury ***
## make_categoryMidrange .
## max_power ***
## fuelPetrol ***
## transmissionManual ***
## ownerFourth & Above Owner ***
## ownerSecond Owner ***
## ownerTest Drive Car ***
## ownerThird Owner ***
## engine ***
## year:make_categoryLuxury ***
## year:make_categoryMidrange
## year:max_power ***
## year:fuelPetrol ***
## year:engine ***
## make_categoryLuxury:max_power
## make_categoryMidrange:max_power ***
## make_categoryLuxury:fuelPetrol
## make_categoryMidrange:fuelPetrol ***
## make_categoryLuxury:transmissionManual ***
## make_categoryMidrange:transmissionManual
## make_categoryLuxury:engine ***
## make_categoryMidrange:engine ***
## max_power:fuelPetrol ***
## max_power:engine ***
## fuelPetrol:engine ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2714 on 5261 degrees of freedom
## Multiple R-squared: 0.8675, Adjusted R-squared: 0.8668
## F-statistic: 1275 on 27 and 5261 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_int_model_log))
## [1] 28
#Leave one out cross validation RMSE
sqrt(mean((resid(bic_int_model_log) / (1 - hatvalues(bic_int_model_log))) ^ 2))
## [1] 0.2724199
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_int_model_log,which=c(1,2))
# Diagnostic tests
bptest(bic_int_model_log)
##
## studentized Breusch-Pagan test
##
## data: bic_int_model_log
## BP = 426.19, df = 27, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_int_model_log), size = 5000)
shapiro.test(residual_subset)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.98083, p-value < 2.2e-16
# Multicollinearity test
vif(bic_int_model_log)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## year 2.520117e+01 1 5.020076
## make_category 2.055362e+11 2 673.320845
## max_power 8.999606e+05 1 948.662521
## fuel 5.185422e+05 1 720.098743
## transmission 3.121457e+00 1 1.766765
## owner 1.597883e+00 4 1.060335
## engine 9.542579e+05 1 976.861272
## year:make_category 2.043151e+11 2 672.318594
## year:max_power 8.999602e+05 1 948.662324
## year:fuel 5.174020e+05 1 719.306576
## year:engine 9.542519e+05 1 976.858189
## make_category:max_power 3.705888e+03 2 7.802308
## make_category:fuel 8.709786e+00 2 1.717916
## make_category:transmission 2.416656e+01 2 2.217194
## make_category:engine 3.782037e+03 2 7.842083
## max_power:fuel 8.622232e+01 1 9.285598
## max_power:engine 7.362371e+01 1 8.580426
## fuel:engine 1.080291e+02 1 10.393704
There are 28 parameters used in the interaction model after BIC step wise feature selection.
The small refitted additive model after log transformation has R^2=0.8675, indicating that 86.75% of the variance in selling_price_in_10k is explained by the predictors.
Many interactipon variables engine, fuel year make_ctegory and max_power have a Variance Inflation Factor (VIF) > 5, indicating multicollinearity among all predictors used in the interaction term.
The Breusch-Pagan failed with test statistic value of BP = 144.39 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated
The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.99306 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated
The LOOCV RMSE of the model is 0.2536943, which is very low when compared to the range of values for the response variable selling price.
All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.918)
| Model Description | Adjusted R^2 | Number of parameters |
Breusch-Pagan test | Shapiro-Wilk test (with 5k sample residuals) |
VIF>5 | LOOCV RMSE |
|
|---|---|---|---|---|---|---|---|
| 1 | Full additive model | 0.7122 | 17 | Constant variance assumption violated(BP = 1310) |
Normality assumption vioalted with W = 0.71385 |
None | 27.30411 |
| 2 | Small additive model | 0.7097 | 14 | Constant variance assumption violated(BP = 1308.8) |
Normality assumption vioalted with W = 0.70295 |
None | 27.39708 |
| 3 | Interaction model | 0.726 | 17 | Constant variance assumption violated(BP = 1522.5) |
Normality assumption vioalted with W = 0.73061 |
transmission seller_type fuel max_power |
26.73532 |
| Model Description | Adjusted R^2 | Number of parameters |
Breusch-Pagan test | Shapiro-Wilk test (with 5k sample residuals) |
VIF>5 | LOOCV RMSE |
|
|---|---|---|---|---|---|---|---|
| 1 | Full additive model with BIC | 0.8518 | 17 | Constant variance assumption violated(BP = 387.08) |
Normality assumption vioalted with W = 0.98726 |
None | 0.2871987 |
| 2 | Small additive model with BIC | 0.8469 | 14 | Constant variance assumption violated(BP = 390.86) |
Normality assumption vioalted with W = 0.98479 |
None | 0.2916997 |
| 3 | Refitted small additive model without influential points |
0.8671 | 13 | Constant variance assumption violated(BP = 293.12) |
Normality assumption vioalted with W = 0.99396 |
None | 0.2531371 |
| 4 | Small additive BIC weighted model | 0.8958 | 13 | Constant variance assumption violated(BP = 144.39) |
Normality assumption vioalted with W = 0.99306 |
None | 0.2536943 |
| 5 | Full Interaction BIC model | 0.8675 | 28 | Constant variance assumption violated(BP = 426.19) |
Normality assumption vioalted with W = 0.98083 |
engine, fuel year, make_ctegory, max_power |
0.2724199 |
price_predictions_with_log = predict(weighted_model, newdata = test_data)
price_predictions = exp(price_predictions_with_log)
TEST_RMSE = sqrt(mean((test_data$selling_price_in_10k - price_predictions)^2))
TEST_RMSE
## [1] 18.64312
sst = sum((test_data$selling_price_in_10k - mean(test_data$selling_price_in_10k))^2)
sse = sum((test_data$selling_price_in_10k - price_predictions)^2)
TEST_R_SQRD = 1 - (sse / sst)
TEST_R_SQRD
## [1] 0.8731878
test_residuals = test_data$selling_price_in_10k - price_predictions
plot(price_predictions, test_residuals,
xlab = "Predicted Selling Price",
ylab = "Residuals",
main = "Residuals vs Predicted Values",
pch = 16, col = "blue")
abline(h = 0, col = "red", lwd = 2)